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ABSTRACT 

Influence maximization is the problem of finding a set of 
users in a social network, such that by targeting this set, 
one maximizes the expected spread of influence in the net- 
work. Most of the literature on this topic has focused ex- 
clusively on the social graph, overlooking historical data, 
i.e., traces of past action propagations. In this paper, we 
study influence maximization from a novel data-based per- 
spective. In particular, we introduce a new model, which 
we call credit distribution, that directly leverages available 
propagation traces to learn how influence flows in the net- 
work and uses this to estimate expected influence spread. 
Our approach also learns the different levels of influence- 
ability of users, and it is time-aware in the sense that it 
takes the temporal nature of influence into account. 

We show that influence maximization under the credit dis- 
tribution model is NP-hard and that the function that de- 
fines expected spread under our model is submodular. Based 
on these, we develop an approximation algorithm for solving 
the influence maximization problem that at once enjoys high 
accuracy compared to the standard approach, while being 
several orders of magnitude faster and more scalable. 

1. INTRODUCTION 

Motivated by applications such as viral marketing [5] , per- 
sonalized recommendations [15], feed ranking [8], and the 
analysis of Twitter [16, 1], the study of the propagation of 
influence exerted by users of an online social network on 
other users has received tremendous attention in the last 
years. One of the key problems in this area is the identifica- 
tion of influential users, by targeting whom certain desirable 
outcomes can be achieved. Here, targeting could mean giv- 
ing free (or price discounted) samples of a product and the 
desired outcome may be to get as many customers to buy 
the product as possible. Kempe et al. [10] formalized this 
as the influence maximization problem: find k "seed" nodes 
in the network, for a given number k, such that by acti- 
vating them we can maximize the expected influence spread, 
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i.e., the expected number of nodes that eventually get acti- 
vated, according to a chosen propagation model. The prop- 
agation model governs how influence diffuses or propagates 
through the network (see Section 2 for background on the 
most prominent propagation models adopted by [10]). Fol- 
lowing this seminal paper, there has been substantial work 
in this area (see Section 2.1). In this paper we study influ- 
ence maximization as defined by Kempe et al., but from a 
novel, data-based perspective. 

Influence maximization requires two kinds of data - a 
directed graph G and an assignment of probabilities (or 
weights) to the edges of G, capturing degrees of influence. 
E.g., in Figure 1, the probability of the edge (v,u) is 0.25 
and it says there is a probability 0.25 with which user v 
influences u and thus v's actions will propagate to u with 
probability 0.25. In real life, while the digraph representing 
a social network is often explicitly available, edge proba- 
bilities are not. Facing difficulties in gathering real action 
propagation traces from which to "learn" edge probabili- 
ties, previous work has resorted to simply making assump- 
tions about these probabilities. The methods adopted for 
assignment of probabilities to edges include the following: 
(i) treating them as constant (e.g., 0.01), (ii) drawing val- 
ues uniformly at random from a small set of constants, e.g., 
{0.1, 0.01, 0.001} in the so-called trivalency "model", or (iii) 
defining them to be the reciprocal of a node's in-degree, in 
the so-called weighted cascade "model" (see e.g., [10, 3, 2]). 
Only recently researchers have shown how to learn the edge 
probabilities from real data on past propagation traces of 
actions performed by users (nodes) [14, 7]. 

Given that there have been several ad hoc assumptions 
about probability assignment as well as recent techniques 
for learning edge probabilities from real data, some natu- 
ral questions arise. What is the relative importance of the 
graph structure and the edge probabilities in the influence 
maximization problem? To what extent different methods 
of edge probability assignment accurately describe the influ- 
ence propagation phenomenon? In particular, how do the 
various edge probability assignments considered in earlier 
literature compare with probabilities learned from real data 
when it comes to accurately predicting the expected influ- 
ence spread? Learning edge probabilities from real data is 
prone to error either owing to noise in the data or to the 
inherent nature of mining these probabilities. How robust 
are solutions to influence maximization against such noise? 

As we will discuss in the next section, the influence max- 
imization process based on Monte Carlo (MC) simulation is 
computationally expensive, even when the edge probabilities 
are given as input. Having to learn these probabilities, from 
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Figure 1: The standard influence maximization pro- 
cess (in light blue), and our approach (in magenta). 

a large database of traces, only adds to the complexity. Can 
we avoid the costly learning and simulation approach, and 
directly mine the available log of past action propagation 
traces to build a model of the spread of any given seed set? 

Our research is driven by the questions above, and it 
achieves the following contributions. 

• We conduct a detailed empirical evaluation of different 
methods of edge probability assignment as well proba- 
bilities learned from real propagation traces and show 
that methods that don't learn probabilities from real 
data end up choosing very different seed sets than those 
that do. Secondly, we show the spread predicted by 
methods based on edge probability assignment suffers 
from large errors, compared to methods that learn edge 
probabilities from real data. This offers some evidence 
that the former class of methods risk choosing poor 
quality seeds (Section 3). 

• We develop a new model called credit distribution, built 
on top of real propagation traces that allows us to di- 
rectly predict the influence spread of node sets, without 
any need for learning edge probabilities or conducting 
MC simulations (Section 4). 

• We show that influence maximization under credit dis- 
tribution is NP-hard. However, we show the function 
defining influence spread under this model is monotone 
and submodular. Using this, we develop a greedy algo- 
rithm that guarantees a (1 — l/e)-approximation to the 
optimal solution and is scalable (Section 5). 

• We conduct a comprehensive set of experiments on large 
real- world datasets (Section 6). We compare our pro- 
posal against the standard approach of [10] with edge 
probabilities learned from real data, and show that the 
credit distribution model provides higher accuracy. We 
also demonstrate the scalability of our approach by 
showing our results on very large real world networks, 
on which the standard approach is not practical. 

2. BACKGROUND 

Given a directed graph G = (V,E,p), where nodes are 
users and edges are labeled with influence probabilities 
among users, the influence maximization problem asks for 
a seed set of users, that maximizes the expected spread of 
influence in the social network, under a given propagation 
model. Kempe et al. [10] mainly focus on two propaga- 
tion models - the Independent Cascade (IC) and the Linear 
Threshold (LT) models. In both, at a given time, each node 



Algorithm 1 Greedy 
Input: G,k,a m 
Output: seed set S 
1: S<- 

2: while |5| < k do 

3: it -f- aigm&K wev _ s (a m (S + w) - cr m (S)); 
4: S^S + u 



can be either active or inactive. Each node's tendency to be- 
come active increases monotonically as more of its neighbors 
become active, and an active node never becomes inactive 
again. Time unfolds in discrete steps. 

In the IC model, each active neighbor v of a node u has 
one shot at influencing u and succeeds with probability p v ,u, 
the probability with which v influences u. In the LT model, 
each node u is influenced by each neighbor v according to 
a weight p VlU , such that the sum of incoming weights to u 
is no more than 1. Each node u chooses a threshold 8 U 
uniformly at random from [0, 1]. At any timestamp t, if the 
total weight from the active neighbors of an inactive node u 
is at least 6 U , then u becomes active at timestamp t + 1. 

In both the models, the process repeats until no new node 
becomes active. Given a propagation model m (e.g., IC or 
LT) and an initial seed set 5* C V, the expected number 
of active nodes at the end of the process is the expected 
(influence) spread, denoted by a m (S). 

The influence maximization problem is defined as follows. 

Problem 1 (Influence Maximization). Given a di- 
rected and edge-weighted social graph G = (V, E,p), a prop- 
agation model m, and a number k <\V\, find a set S C V , 
\S\ — k, such that a m (S) is maximum. 

Under both the IC and LT propagation models, this prob- 
lem is shown to be NP-hard [10]. Kempe et al., however, 
showed that the function cr m (S) is monotone and submod- 
ular. A function / from sets to reals is monotone if 
f(S) < f(T) whenever SCT. A function / is submodular 
if f (S + w)- f(S) > f(T + w)- f(T) whenever S C T. 1 
Submodularity intuitively says an active node's probability 
of activating some inactive node u does not increase if more 
nodes have already attempted to activate u. 

For any monotone submodular function / with /(0) = 0, 
the problem of finding a set S of size k such that f(S) is max- 
imum, can be approximated to within a factor of (1 — 1/e) by 
a greedy algorithm [13] , a result that directly carries over to 
the influence maximization problem [10] (see Algorithm 1). 

The complex step of the greedy algorithm is in line 3, 
where we select the node that provides the largest marginal 
gain a m (S + w) — a m (S) with respect to the expected spread 
of the current seed set S. Computing the expected spread 
given a seed set is #P-hard under both the IC model [2, 
8] and the LT model [4]. In their paper, Kempe et al. run 
MC simulations of the propagation model for sufficiently 
many times (the authors report 10, 000 trials) to obtain an 
accurate estimate of the expected spread, resulting in a very 
long computation time. 

In the majority of the literature on influence maximization 
following [10], the edge-weighted social graph is assumed as 
input to the problem, without addressing the question of 
how the probabilities are obtained. In Figure 1, we summa- 
rize the standard process followed in influence maximization 

In the rest of the paper we write S + w in place of S U {w} 
and similarly S — T in place of S\T. 
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and we make explicit the phase of learning the edge prob- 
abilities. The process starts with the (unweighted) social 
graph and a log of past action propagations that say when 
each user performed an action. The log is used to estimate 
influence probabilities among the nodes. This produces the 
directed edge-weighted graph which is then given as input 
to the greedy algorithm which produces the seed set using 
MC simulations. 

2.1 Other Related Work 

Domingos and Richardson [5] first introduced the problem 
of identifying influential users for a marketing campaign as a 
learning problem, which Kempe et al. [10] subsequently for- 
mulated as an optimization problem. Exploiting submod- 
ularity, Leskovec et al. [12] develop an efficient algorithm 
called CELF, based on a "lazy-forward" optimization in se- 
lecting new seeds. CELF is up to 700 times faster than the 
simple greedy algorithm, while delivering the same approx- 
imation guarantee (more details in Section 5.3). In spite of 
this big improvement their method still faces serious scal- 
ability issues [3], which has motivated recent works on effi- 
cient heuristics for overcoming the efficiency and scalability 
limits of the greedy algorithm [11, 3, 2, 4]. 

Chen et al. [2] propose PMIA heuristic to estimate influ- 
ence spread under the IC model. They consider the influence 
flow via Maximum Influence Paths (MIP) instead of short- 
est path [11]. An MIP between a pair of nodes (v,u) is the 
path with the maximum propagation probability from v to 
u. More recently, Chen et al. [4] propose a scalable heuristic 
called LDAG for the LT model. They construct local DAGs 
for each node and consider influence only within it. Comput- 
ing expected spread over DAGs can be done in linear time 
while over general graphs it is #P-hard [4]. While the PMIA 
and LDAG heuristics don't offer theoretical guarantees, the 
authors show empirically that these solutions are quite close 
to those obtained using the corresponding greedy algorithm. 
A key distinction with our work is that our proposal offers a 
scalable solution to influence maximization with an approx- 
imation guarantee. 

The above body of work assumes a weighted social graph 
as input and does not address how the edge probabilities 
may be obtained. Saito et al. [14] study how to learn the 
probabilities for the IC model from a set of past propa- 
gations. They formalize this as a likelihood maximization 
problem and then apply the expectation maximization (EM) 
algorithm to solve it. In our experiments, we use their 
method to learn the probabilities for the IC model. 

Goyal et al. [7] also study the problem of learning influ- 
ence probabilities. They focus on the time varying nature 
of influence, and on factors such as the influenceability of 
a specific user, and influence-proneness of a certain action. 
They also show that their methods can be used to predict 
whether a user will perform an action and at what time, with 
higher accuracy for users with higher influenceability scores. 

Our work is different from all of the above in that we pro- 
pose a method for learning a model for directly predicting 
the influence spread for a given node set, bypassing the need 
to learn edge probabilities and to run expensive MC simu- 
lations. We use this as a basis to develop a scalable approx- 
imation algorithm for influence maximization that does not 
make use of any explicit propagation model and is instead 
data-based. To the best of our knowledge, such a data-based 
approach to influence maximization is novel. 



3. WHY DATA MATTERS 

What is the relative importance of the network structure 
and the edge probabilities in determining influence propa- 
gation? How important is it to accurately learn probabil- 
ities from real propagation traces? We have seen that a 
large majority of the literature assumes edge probabilities 
to be randomly chosen from an arbitrary fixed set or to be 
determined by node degrees. How do these methods com- 
pare with that of learning edge probabilities from real data, 
in terms of the quality of seeds selected? To answer this, 
we compare the performance of Algorithm 1 under the IC 
model, with different methods of assigning edge probability. 
To this end, we present two kinds of experiments that, to 
the best of our knowledge, have never been reported before. 

Datasets. We take two real world datasets: Flixster and 
Flickr, both consisting of an unweighted directed social 
graph, along with an associated action log. An action log 
is a set of triples (u, a, i) which say user u performed ac- 
tion a at time t. We refer to the set of triples in the action 
log corresponding to a specific action a as the propagation 
trace (propagation for short) associated with a. Flixster 
(www.flixster.com) is one of the main players in the mo- 
bile and social movie rating business [9]. Here, an action 
is a user rating a movie. In other words, if user v rates 
"The King's Speech", and later on v's friend u does the 
same, we consider the action of rating "The King's Speech" 
as having propagated from v to u. Flickr is a popular 
photo sharing platform. Here, an action is a user joining an 
interest group (e.g., "Nikon Self portrait", "HDR Panora- 
mas"). The raw versions of both datasets are very large 
and as a result, experiments that require repeated MC sim- 
ulations cannot be run within any reasonable time on the 
full data set. While the large version of the datasets are 
useful for testing scalability of our proposal, for other ex- 
periments we have to sample smaller datasets. In what fol- 
lows, we use samples that correspond to taking a unique 
"community", obtained by means of graph clustering per- 
formed using Graclus 2 . The resulting datasets are named 
Flixster_Small and FlickrJSmall (statistics in Table 1). 
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Table 1: Statistics of datasets. 



One of the goals of the experiments is to determine which 
method more accurately predicts the expected spread of 
node sets. So we split the action log into two sets of prop- 
agation traces - training and test sets. The edge probabil- 
ities are learnt from the training set and thus, it is crucial 
that the splitting is performed in such a way that a prop- 
agation trace in its entirety falls into training or test set. 
Taking care that similar distributions of propagation sizes 
are maintained in the two sets, we place 80% and 20% of the 
propagations in training and test set respectively. Precisely, 
we sorted the propagation traces based on their size and put 
every fifth propagation in this ranking in the test set. As 

2 

"http : / /www . cs .utexas . edu/users/dml/Sof tware/graclus .html 
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Figure 2: Error as a function of Actual Spread on (a) Flixster_Small, (c) Flickr_Small; (b) Scatter plot of 
predicted spread vs. actual spread on Flixster_Small. The legend in all the plots follows from (a). 



a result, the number of propagations in the training set are 
5. IK and 5.7K for Flixster_Small and Flickr_Small re- 
spectively. The number of tuples in the training set are 1.5M 
and 385. 3K respectively. The training set is used to learn 
the edge probabilities according to the EM-based method of 
Saito et al. [14]. One issue in using their method is that 
in their work, Saito et al. assume that the input action log 
data is as though it was generated by an IC model: i.e., time 
is discrete, and if user u activates at time t, then at least one 
of the neighbors of u was activated at time t — 1. In real- 
world propagations this is not the case. To close this gap 
between their model and the real data, we let all previously 
activated neighbors of a node be its possible influencers. 

Methods experimented. In both our experiments, we 
consider the IC model together with the following methods 
of edge probability assignment based on previous work [10, 
14, 3, 2]: 

WC: probability on an edge (v,u) is l/in-degree(u) 
(known as weighted cascade); 

TV: probabilities are selected uniformly at random from 
the set {0.1,0.01,0.001} (trivalency) . 

UN: all edges are uniformly assigned probability p — 0.01. 

EM: probabilities are learned from the training set using 
the EM-based method [14]. 

PT: Finally, in order to assess how robust the greedy 
method is to noise in the probability learning phase, we 
take EM-lcarnt probability and add noise. More precisely, 
for each edge (v,u) we randomly pick a percentage from 
the interval [—20%, 20%] to perturb p v ,u, rounding to 
or 1 in cases that go below or over 1 respectively. We 
call this method PT (EM perturbed). 

Experiment 1: Seed set intersection. The goal of this 
experiment is to understand the extent to which choice of 
edge probabilities affects the decisions of different methods 
in seed set selection. We run Algorithm 1 under the IC 
model 3 with the various methods of probability assignment 
above, as well as with edge probabilities learned from the 
training data set using EM. In each case, we used the algo- 
rithm to produce a seed set of size k = 50. Table 2 reports 
the size of the intersection for each pair of seed sets. We 
can see that EM, the method using real data to learn the 

3 We found the greedy algorithm on Flickr_Small is too 
slow to complete in a reasonable time, even with CELF op- 
timization. Hence we use the PMIA heuristic [2] (discussed 
in Section 2.1) in order to speed up IC computation. [2] em- 
pirically showed PMIA produces results very close to greedy. 



influence probabilities has a very small, almost empty, inter- 
section with all other methods, with the exception of its own 
perturbed version PT. Thus, we conclude all methods that 
use edge probabilities based on ad hoc assumptions select 
seed sets very different from the method that uses propaga- 
tion trace data to learn edge probabilities. Secondly, noise in 
the learned edge probabilities does not affect the seed set se- 
lection too drastically, as shown by the intersection between 
EM and PT. What can we say about the quality of seed 
sets chosen by the methods UN, TV, and WC, compared 
to EM? This is addressed next. 
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Table 2: Size of seed set intersection for k = 50 on 
Flixster .Small (left) and Flickr .Small (right). 

Experiment 2: Spread prediction. In the second ex- 
periment, we address the question, how good each of the 
methods is at predicting the actual spread. For that end, 
for a given seed set 5*, we compute the expected spread 
°\rc(<S') predicted by each of the methods and compare it 
with the actual spread of S according to ground truth. For 
ground truth, for each propagation (movie in Flixster and 
group in Flickr) in the test set, we take the set of users that 
are the first to rate the movie (or join the group in case of 
Flickr) among their friends, i.e., the set of "initiators" of the 
action, to be the seed set. The actual spread is the number 
of users who performed that action, also called propagation 
size. This allows for a fair comparison of all methods from 
a neutral standpoint, which is a first in itself. 

Figures 2(a) and (c) report the root mean squared er- 
ror (RMSE) between predicted and actual spread on the 
two datasets: propagations in the test set are grouped in 
bins with respect to their size 4 and RMSE is computed in- 
side each bin. On Flixster_Small, uniform method (UN) 
works well but only for small propagations, and trivalency 
(TV) and weighted cascade (WC) work well but only for 
very large propagations (which are only few cases, i.e., out- 
liers), and this is explainable with the fact that they always 
tend to predict the spread as very high. This is clearly shown 
in Figure 2(b), which shows a scatter plot between predicted 
and actual spread on Flixster_Small. 

4 In Flixster_Small bins are defined at multiples of 100, 
in FlickrJSmall at multiples of 20. 
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In the case of Flickr_Small, EM clearly outperforms all 
the other methods for all sizes of actual spread (Figure 2(c)). 
In all cases, the performance of EM and PT are so close that 
they are almost indistinguishable. 

Overall, even if EM tends to underestimate the spread 
when this gets larger, it is by far the most accurate method 
with respect to the ground truth. The other methods that 
do not use the real propagation traces to learn influence 
probabilities are found to be unreliable in predicting the 
true spread. 

By putting the results of the two experiments together, we 
can draw the first conclusion of this paper: methods UN, 
TV, and WC select seed sets that are very different from 
EM and since they can be quite inaccurate in predicting 
the true spread, they can end up selecting seed sets of poor 
quality. It is thus extremely important to exploit available 
past propagation traces to learn the probabilities right. This 
finding strengthens the motivation for the rest of our work. 

4. CREDIT DISTRIBUTION MODEL 

The propagation models discussed in Section 2 are proba- 
bilistic in nature. In the IC model, coin flips decide whether 
an active node will succeed in activating its peers. In the LT 
model it is the node threshold chosen uniformly at random, 
together with the influence weights of active neighbors, that 
decides whether a node becomes active. Under both models, 
we can think of a propagation trace as a possible world, i.e., 
a possible outcome of a set of probabilistic choices. 

Given a propagation model and a directed and edge- 
weighted social graph G = ( V, E, p) , let G denote the set 
of all possible worlds. Independently of the model m cho- 
sen, the expected spread a rn (S) can be written as: 

<r m (S)= £]Pr[JC].o£(S) (1) 

xeG 

where (S) is the number of nodes reachable from S in the 
possible world X. The number of possible worlds is clearly 
exponential. Indeed, computing a m (S) under the IC and LT 
models is #P-hard [2, 4], and the standard approach (see 
[10]) tackles influence spread computation from the perspec- 
tive of Eq. (1): sample a possible world A G G, compute 
a*(S), and repeat until the number of sampled worlds is 
large enough. We now develop an alternative approach for 
computing influence spread, by rewriting Eq. (1), giving a 
different perspective. Let path(S, u) be an indicator random 
variable that is 1 if there exists a directed path from the set 
S to u and otherwise. Moreover let pathx (S, u) denote the 
outcome of the random variable in a possible world X G G. 
Then we have: 

aX(S) = ^pathx(S,u) (2) 

Substituting in (1) and rearranging the terms we have: 

0mOS) = Y Pr[X] pathx {S,u) (3) 
uev xeG 

From the definition of expectation, we can rewrite this to 

a m (S) = Y E[path{S,u)\ = Y Pr[path(S,u) = 1] (4) 
uev uev 

That is, the expected spread of a set S is the sum over 
each node u G V , of the probability of the node u getting 
activated given that S is the initial seed set. 



The standard approach samples possible worlds from the 
perspective of Eq. (1). To leverage available data on real 
propagation traces, we observe that these traces are similar 
to possible worlds, except they are "real available worlds". 
Thus, in this paper, we approach the computation of in- 
fluence spread from the perspective of Eq. (4), i.e., we es- 
timate directly Pr[path(S,u) — 1] using the propagation 
traces that we have in the action log. 

Data Model. We are given a social graph G = (V, E), with 
nodes V corresponding to users and directed (unweighted) 
edges E corresponding to social ties between users, and an 
action log, i.e., a relation h(User, Action, Time) where a 
tuple (u, a, t) £ L indicates that user u performed action a at 
time t. It contains such a tuple for every action performed by 
every user of the system. We will assume that the projection 
of L on the first column is contained in the set of nodes V of 
the social graph G. We let A denote the universe of actions, 
i.e., the projection of L on the second column. Moreover, 
we assume that a user performs an action at most once, and 
define the function t(u, a) to return the time when user u 
performed action a (the value of t(u, a) is undefined if u 
never performed a, and t(u, a) < t(v,a) is false whenever 
either of t(u, a),t(v, a) is undefined). 

We say that a propagates from node u to v iff it and 
v are socially linked, and u performs a before v (we also 
say that u influences v on a). This defines a propagation 
graph of a as a directed graph G(a) = (V(a), E(a)), with 
V(a) = {v G V | 3i : (v,a,t) G L} and E(a) = {(u,v) G 
E | t(u, a) < t(v, a)}. Note that the propagation graph of an 
action a is the graph-representation of the propagation trace 
of a, and it is always a DAG: it is directed, each node can 
have zero or more parents, and cycles are impossible due to 
the time constraint. The action log L is thus a set of these 
DAGs representing propagation traces through the social 
graph. We denote by Mn(w, a) — {v \ (v,u) G E(a)} the 
set of potential influencers of u for action a and dk n (u, a) — 
\Ni n (u,a)\ to be the in-degree of u for action a. Finally, 
we call a user u an initiator of action a if u G V(a) and 
din(u,a) = 0, i.e., u performed action a but none of its 
neighbors performed it before u did. Table 3 summarizes 
the notation used. 

The Sparsity Issue. In order to estimate Pr[path(S,u) — 
1] using available propagation traces, it is natural to inter- 
pret such quantity as the fraction of the actions initiated by 
S that propagated to u, given that S is the seed set. More 
precisely, we could estimate this probability as 

| {a G A\initiate(a, S) & 3t : (u, a,t) G L}| 
\{a G A\initiate(a, S)}\ 

where initiate(a, S) is true iff S is precisely the set of initia- 
tors of action a. Unfortunately, this approach suffers from 
a sparsity issue which is intrinsic to the influence maxi- 
mization problem [10]. If we need to be able to estimate 
Pr[path(S,u) = 1] for any set S and node u, we will need 
an enormous number of propagation traces corresponding to 
various combinations, where each trace has as its initiator 
set precisely the required node set S. It is clearly imprac- 
tical to find a real action log where this can be realized. 
To overcome this obstacle, we propose a different approach 
to estimating Pr[path(S,u) = 1] by taking a "u-centric" 
perspective: we assign "credits" to the possible influencers 
of a node u whenever u performs an action. The model is 
formally described next. 
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An 


Number of actions performed by u. 


N in (u,a) 


Neighbors of u which activated on action a before 
i.e., it's potential influencers on action a. 


■yv,u(a) 


Direct influence credit given to v for influencing 
u for action a. 


T v u (a) 


Total credit given to v for influencing u for 
action a. 




Total credit given to v for influencing u for 
all actions. 




Total credit given to x for influencing u for action 
a considering the paths that are completely 
contained in W C V . 




Average time taken by actions to propagate 
from user u to user v. 



Table 3: Notation adopted in the next sections. 



Credit Distribution. When a user u performs an action a, 
we want to give direct influence credit, denoted by j v ,u(a), 
to all v £ Ni n (u,a), i.e., all neighbors of u that have per- 
formed the same action a before u. We constrain the sum 
of the direct credits given by a user to its neighbors to be 
no more than 1. We can have various ways of assigning di- 
rect credit: for ease of exposition, we assume for the mo- 
ment to give equal credits to each neighbor v of it, i.e., 
7u,u(a) = l/di n (u, a) for all v £ Ni n (u, a). Later we will 
see a more sophisticated method of assigning direct credit. 

Intuitively, we also want to distribute influence credit 
transitively backwards in the propagation graph G(a), such 
that not only u gives credit to the users v £ 7Vi n (u, a), but 
they in turn pass on the credit to their predecessors in G(a) 
and so on. This suggests the following definition of total 
credit given to a user v for influencing u on action a, corre- 
sponding to multiple propagation paths: 

I\,, u (a) = 2Z r v , w (a) • 7tu,u(a) (5) 

where the base of the recursion is r„,„(a) = 1. Sometimes, 
when the action is clear from the context, we can omit it 
and simply write y v>u and T V:U . From here on, as a running 
example, we consider the influence graph in Figure 1 as the 
propagation graph G(a) with edges labeled with direct cred- 
its 7„,u(a) = l/di n (u,a). For instance, 

= 1 • 0.25 + 0.5 • 0.25 + 1 • 0.25 + 0.5 • 0.25 = 0.75. 

We next define the total credit given to a set of nodes 
S C V(a) for influencing user u on action a as follows: 

r ffll= f 1 ifweS; 
S 1 Eu, S iv in Ka) r s,™( a )-7™,4 a ) otherwise 

Consider again the propagation graph G(a) in Figure 1. 
Let S — {v, z}. Then, Ts,u is the fraction of flow reaching 
it that flows from either v or z: 

Fs,u = Fs.tu ■ j w ,u + F,s,v ■ 7u,u + Fs jt • jt,u + Fs, z • 7z,u 
= 1 • 0.25 + 1 ■ 0.25 + 0.5 ■ 0.25 + 1 • 0.25 = 0.875. 

Aggregating Over All Actions and All Nodes. The 

next question is how to aggregate the influence credit over 
the whole action log L. Consider two nodes v and u: the 
total influence credit given to v by u for all actions in A, is 
simply obtained by taking the total credit over all actions 
and normalizing it by the number of actions performed by 
u (denoted A u )- This is justified by the fact that credits 



are assigned by u backward to its potential influencers. We 
define: 

K v ,v = -j- ^2 r„, u (a) (6) 

Intuitively, it denotes the average credit given to v for in- 
fluencing u, over all actions that u performs. Similarly, for 
the case of a set of nodes S C V, we can define the total 
influence credit for all the actions in A as: 

= "I - H T s,u(a) (7) 

Note that ks,u corresponds, in our approach, to 
Pr[path(S,u) = 1] in Eq. 4. Finally, inspired by Eq. 4, 
we define the influence spread (r c d{S) as the total influence 
credit given to S from the whole social network: 

O-cd(S) = Ks,u (8) 

In the spirit of influence maximization (Problem 1), this 
is the objective function that we want to maximize. In the 
next section we formally state the problem of maximizing 
influence under the CD model. We prove that the problem is 
NP-hard and that the function cr c( j(.) is submodular, paving 
the way for an approximation algorithm. 
Assigning Direct Credit. We now revisit the problem 
of defining the direct credit 7„ ltl (a) given by a node u to 
a neighbor v for action a. In our previous work [7], we 
observed that influence decays over time in an exponential 
fashion and that some users are more influenceable than 
others. Motivated by these ideas, we propose to assign direct 
credit as: 

JvAa) = . exp (_ t(u,a)-t(v,a) \ 

Here, t v:u is the average time taken for actions to prop- 
agate from user v to user u. The exponential term in the 
equation achieves the desired effect that influence decays 
over time. Moreover, infl(u) denotes the user influenceabil- 
ity, that is, how prone the user u is to influence by the social 
context [7] . Precisely, infl(u) is defined as the fraction of ac- 
tions that u performs under the influence of at least one of 
its neighbors, say v, i.e., u performs the action, say a, such 
that t(u,a) — t(v,a) < t v , u ', this is normalized by Ni n (u,a) 
to ensure that the sum of direct credits assigned to neigh- 
bors of u for action a is at most 1. Note that both infl(u) 
and T. tU are learnt from (the training subset of) L. 
Discussion. It should be pointed out that unlike classical 
models such as IC and LT, the credit distribution model 
is not a propagation model. Instead, it is a model that, 
based on available propagation data, learns the total influ- 
ence credit accorded to a given set S by any node u and uses 
this to predict the influence spread of S. It is not suscepti- 
ble to the sparsity issue discussed above, and it obviates the 
need to perform expensive MC simulations for the purpose 
of estimating influence spread. 

5. INFLUENCE MAXIMIZATION 

We next formally define the problem studied in this paper. 

Problem 2 (Influence Maximization - CD model). 
Given a directed social graph G = (V,E), an action log L, 
and an integer k < \V\, find a set S C V, \S\ = k, that 
maximizes a c d(S). 
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Theorem 1. Influence maximization under the credit 
distribution model is NP-/iarrf. 

Proof. We prove the hardness by reducing the well- 
known NP-complete problem Vertex Cover [6] to our prob- 
lem. Given an instance X of Vertex Cover, consisting of 
an undirected graph G = (V, E) and a number k, create 
an instance J of the influence maximization problem un- 
der CD as follows. The directed social graph G' = (V,E') 
associated with J has the same node set as G. E' two 
directed edges (u,v) and (v,u) in place of every undirected 
edge (u, v) G E. We express the action log L associated with 
J in terms of propagation graphs, for convenience. For each 
edge (v,u) G E, create two propagation graphs, correspond- 
ing to two actions ai and 112 in L, consisting of only two 
nodes v and u. In the propagation graph G(ai), create an 
edge from v to u indicating that the corresponding action 
is being propagated from v to it. In the propagation graph 
G{a,2), create an edge from u to c. Assign direct credits 
7«,«(ai) = Yu,i>(a2) = a where a G (0,1]. For instance, if 
we assign direct credits simply as r y v , u (a) = l/di n (H, a), then 
a = 1. Similarly, if we assign direct credits as in equation 
9, then a — 1/e. The reduction clearly takes polynomial 
time. We next prove that a set S C V, with |S| < k, is a 
vertex cover of G if and only if its influence spread a c d(S) 
in the instance J is at least k + a ■ (\V\ — k)/2. 

Only if: Suppose S is a vertex cover of G in the in- 
stance X. Consider any arbitrary node u. If u G S, then 
K s,u = 1 by definition. On the other hand, if u £ S, then 
ts,« = X) a rs,u(a)/(2 • deg(u)). Since it is not in the vertex 
cover, all its neighbors must be in S and thus, for exactly 
half of the actions a that it performs, Tg <n (a) = a. These 
are the actions that it performs after its neighbor. Hence, 
Y^ a ^s.u(a) = a ■ deg(u) and ks, u = a/2, where deg{u) 
is it's degree in G. This implies o c d(S) — X) u gv K s,u — 
k + a- (\V\-k)/2. 

If: Let S be any seed set whose spread is at least k + a ■ 
(|V| — k)/2 in instance J . Let N(u) be the set of neighbors 
of 11 in G, that is, N(u) ={t£ V\(v, u) G E}. Consider an 
arbitrary node 11 ^ S. For each node v in N(u) n S, v has a 
credit of a over it, for the unique action whose propagation 
graph is the edge (v,u), and a null credit for all the other 
actions. Therefore k V:U = a/(2 ■ deg(u)). Aggregating over 
the whole seed set S we have that ks,u = a • \N(u) D S\/(2 ■ 
deg(u)). Hence, cr cd (S) = E„ 6 v K s, u = k + Y, u ev\s a ' 
\N(u)nS\/(2-(deg(u)). 

From our assumption, o c d(S) > k+a-(\V\—k)/2, it follows 
that J2uev\s \ N ( U ) n S\/deg(u) > \ v \ ~ k - As \ N ( U ) n s 'l < 
deg(u), this is possible only when Vit G V \ S : \ N(u) n S\ = 
deg(u), implying that all neighbors of u must be in S and 
therefore, S is a vertex cover in instance X. □ 

Since the problem is NP-hard, we are interested in devel- 
oping an approximation algorithm. We prove that the in- 
fluence spread function is submodular under the CD model, 
paving the way for efficient approximation. 

Theorem 2. a c d(S) is monotone and submodular. 

Proof. It suffices to show that Ts, u (o,) is monotone and 
submodular as a positive linear combination of monotone, 
submodular functions is also monotone and submodular [13] . 
Clearly it is monotone. We prove submodularity by induc- 
tion on path lengths. Note that the propagation graph for a 



given action is acyclic and hence the maximum path length 
is \V\ — 1. Let Fs,u{a, €) denote the total credit obtained by 
the set S for influencing it, restricting attention to paths of 
length < 1. Thus, T s , u (a) = T s , u (a, \V\ - 1). 

Let S and T be two node sets such that SCT and let x ^ 
T. Recall that the function Y is submodular iff T s+x,u(a) — 
Ys,u{a) > TT+x,u(a) — Tr.u(a). We call the left hand side 
of the inequality the marginal gain of x with respect to S 
(implicitly understood to be on u) and similarly for the right 
hand side. 

Base Case: In the base case, I = 0. Depending on it, the 
base case can be split into various sub-cases: (a) If it G S, 
then the marginal gain of x with respect to both S and T 
is 0; (b) If u 6 T, u ^ S, then while a;'s marginal gain with 
respect to T is 0, its marginal gain with respect to S is no 
less than as the function T(-) is monotone; (c) If u = x, 
then the marginal gain of x with respect to both S and T is 
exactly 1; (d) If 11 7^ x and it ^ T, then the total credits on 
it from S, S + x, T and T + x axe 0. This proves the base 
case. 

Induction Step: Assume that the function P is submodu- 
lar when restricted to path lengths < £, that is, Vw G V: 

T s +x, w (aj)- F s , w {a,£)> T T+X , W [a, I)- P T ,„ (a,i) (10) 

We will prove that the function remains submodular for 
paths of length £ + 1 for any node u G V. Consider 
the marginal gain of x with respect to 5* on it when re- 
stricted to paths of length < £ + 1, that is, consider 
Fs+x,u(a, £+1) — Fs,u(a, £+1). By definition, this is equal to 
5^™eiv in (u,a) Fs+x,w(a, £) ■ Jw,u(a) — X^ U Giv in (u,a) ^s,w(a, £) ■ 
7uj,u(a). Taking the common factor ^/ WtU (a) out and apply- 
ing induction hypothesis (Eq. 10), this is 

> ^2 (T T +x,w(a,£) - Y T ,vi{a,l)) ■ jw,u(a) 

w £jVi n (it,a) 

= r T +x,u(aJ + i)-rT,u(a,e + i) 

This was to be shown. □ 

In the remaining sub-sections, we develop an efficient ap- 
proximation algorithm to solve the influence maximization 
problem under the CD model. 

5.1 Overview of our method 

In the previous section, we show that while influence 
maximization under the CD model is NP-hard, the influ- 
ence spread function is monotone and submodular. Con- 
sequently, the greedy algorithm (Algorithm 1) provides a 
(1 — 1/ e)-approximation to the optimum solution [13] . How- 
ever, the greedy algorithm by itself does not guarantee ef- 
ficiency as it requires to compute the marginal gain of a 
candidate seed node with respect to the current seed set, 
i.e., o c d(S + it;) - a cd (S) (line 3 of Algorithm 1). For the IC 
and LT models, this is done by expensive MC simulations. 
For the CD model, the marginal gain can be directly com- 
puted from the action log L. A naive way to do this would 
be to scan L in each iteration. But this approach would be 
very inefficient. Hence, we focus our attention on comput- 
ing the marginal gain efficiently by carefully exploiting some 
properties of our model. 

From here on, with a superscript W C V on the func- 
tion F(-), we denote the function to be evaluated on the 
sub-graph induced by nodes in W. For example, T™ u (a) is 
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the total credit given to node x for influencing node it to 
perform action a considering the paths that are contained 
completely in the sub-graph induced by V(a) n W. That 
is, the sub-graph of the propagation graph for action a, in- 
duced by the nodes W D V(a). When the superscript is 
not present the graph considered is the whole propagation 
graph for action a, i.e., r x>u (a) = r^^(a). It should be 
noted that the direct credit y x<u is always assigned consid- 
ering the whole propagation graph. The following result is 
key to the efficiency of our algorithm. 



Theorem 3. 



cd(S + X) - <T cd (S) 



E 



(1 



.(«))•£ 



» 



Intuitively, the theorem says that the marginal gain of 
a node x equals the sum of normalized marginal gain of 
x on all actions. We give more insights into this equation 
in the proof. The theorem provides us an efficient method 
to compute the marginal gain, given values of Ts,x(a) and 
rJf,u S (i) : this is the key idea behind the efficiency of our 
algorithm, which can be abstractly summarized as follows: 

1. Initially, scan the action log L and compute T V]U (a) 
for all combinations of v, u and a (Algorithm 2). Note 
that at the beginning, S — and hence Ts,x(a) = 
for all combinations of x and a. 

2. In each iteration of the greedy method, a node that 
provides the maximum marginal gain is added to the 
seed set. For this step we adopt the CELF [12] opti- 
mization idea (Algorithm 3). 

3. To compute the marginal gain of a node x efficiently, 
we use Theorem 3. It requires values of r^„ s (a) and 
r s , x (a) (Algorithm 4). 

4. Once a node is added to the seed set, T^^ s (a) and 
Ts,x(a,) are updated using Lemmas 2 and 3 (Alg. 5). 

5.2 Proof of Theorem 3 

The proof of Theorem 3 is non-trivial and we need to prove 
a few auxiliary claims first. In the process, we also derive 
equations to update the total credit (step 4 in the outline of 
our algorithm above). 



Lemma 1. r s ,,(o) = E„ es r^^ s+ " 



(a) 



We first explain the claim by means of an example by tak- 
ing the influence graph in Figure 1 as a propagation graph 
G(a) with direct credit 7„ >11 (<i) — l/di n (u,a). Let S = 
{v,z}, according to Lemma 1 (dropping the argument a), 
Fs, u = r£-*+r£-" = (0.25+0.25+0.5-0.25)+0.25 = 0.875. 
Note that the credit given to v via the path v — > t — ► z — ► u 
is ignored. Next, we formally prove the claim. 
Proof of Lemma 1: By induction on path lengths. Recall 
that Ts,u{a,l) denotes the total credit given to set S for 
influencing node u over paths of length no more than I. 

Base Case: I = implies only u can get credit for influ- 
encing itself. Therefore if u ^ S, both sides of the equality 
become 0. When u G S, the total credit given to S for in- 
fluencing u (left hand side) is 1 by definition, while in the 
right hand side all terms in the summation are except the 



case v = u, that is T 



V-S+v, 



(a,0) 



Induction Step: Assume that the lemma is true for path 
lengths no more than I. We prove it for path length up 
to I + 1. We start with definition of Ts,u(a, I + 1) (Eq. 5). 



r s ,«(a,i + i) = £ 



Ts,w(a,l) ■ 7 TO , u (a). Applying 



induction hypothesis, the right hand side becomes: 



E 

w£Ni n {it, a) 



E r " 



V-S+v 



(a, 2) 



.(«) 



I ^ ' 

«6S \weN iTl (v.,a) 



S+v (a,l)--y w , u (a) 
This concludes the proof. □ 



Next, we show how the total credit can be updated in- 
crementally, when the induced sub-graph under considera- 
tion changes. Consider the sub-graph induced by the nodes 
W = V — S where S is the current seed set. Let T^ u (a) 
be the total credit given to node v for influencing u in this 
sub-graph. Suppose node x is added to the seed set, then 
we are interested in computing T^~ x {a). This is clearly the 
total credit given to v minus the total credit given to v via 
paths that go through x. More precisely, we have: 



Lemma 2. r; 



» = r^(a) 



,(«)-rS»( fl ). 



As an example, consider again the propagation graph in 
Figure 1 with S = {t, z}. The total credit given to v for 
influencing u on the subgraph induced by nodes in V — {t, z} 
is 1 • 0.25 + 0.25 = 0.5. Suppose w is added to the seed set 



S, then r; ; 



0.5-1-0.25 = 0.25. 



The next lemma shows how to update incrementally the 
total credit of influence given to a set S by a node u, after x 
is added to the set. This is needed in step 4 of our method 
as sketched in previous section. 



Lemma 3. F 



S-\-x,u 



<a) = r 



■(i-r s ,«(a)) 



Proof: Since we refer to a single action, we drop the argu- 
ment a and assume it implicitly. We use Lemma 1 to expand 
Fs+1,11 and Ts,u- 



^-*S + x,u ^S,u ^ ^v,u 

v£S+x 

-,V-S 



Vr l '- ! 



S — x-\-v \ v j-iV— S+v 

ves 



= r 



/ j\ v ,u v,u J 



Applying Lemma 2 to the terms inside the summation 
(with W — V — S + v), the right hand side becomes 



E( r " 



S+v _ -pV- 



The terms inside the summation denote the total credit 
given to v for influencing u considering the paths that go 
through x in the sub-graph induced by the nodes V — S + v. 
Since the graph is acyclic, if T],^ s+V is non-zero, then any 
path from x to u cannot pass through v. Hence, T^~ s+V ■ 
rY7, S+v = T^~ s+V ■ r^- s . Note that this equality holds 

V-S+v 



even when F 



= as both sides would be in that 



case. Thus, r s+a , iU -r s , u = F x 



V-S+v r v-S\ 



The term r: 



can be taken out of the summation and 



-iV-S+v 



applying Lemma 1 gives E„ e s r «,^ 
the lemma follows. □ 

Finally, we are ready to prove Theorem 3. 



from which 
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Proof of Theorem 3: By definition, 

o- c d{S + x) - a cd (S) = ^2 -j- ^(rs +a: ,u(a) - Ts,u(ct)) 



Applying lemma 3, the right hand side becomes 

«6V " aeA 

Rearranging the terms, we get o- ca {S + x) — a c d(S) = 
E ae ^ ((1 - r s , B (o)) ■ E u6 v ± ■ rr,; S (a)) , which was to 
be shown. □ 

5.3 Algorithms 

In this section, we present our algorithm which builds on 
the properties developed in previous sections and whose out- 
line was given in Section 5.1. Initially, we scan the action 
log L and then, we use the greedy algorithm with CELF 
optimization to select the seed set. While scanning the ac- 
tion log, we maintain all the information needed to select k 
seeds later. In particular, we compute total credit given to 
each node v for influencing any other node u for all actions 
a and record it into the data structure UC (User Credits). 
Each entry C/C[«][w][a] corresponds to r^„ s (a), that is, to- 
tal credit given to v for activating u on the graph induced 
by V — S where S is the current seed set. We also maintain 
another data structure SC (Set Credits) where each entry 
SC[i][o] refers to the total credit given to the current seed 
set S by a node x for an action a, that is, Ts, x (a). Since S is 
empty in the beginning, SC is not used in the first iteration. 

Algorithm 2 Scan 
Input: G,L, A 
Output: UC 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 



£/C<- 

for each action a in L do 
current_table <— 

for each tuple (it, a, t u ) in chronological order do 
Parents(u) «- 0; A u <- A u + 1; l7C[*][«][o] *- 
while 3v : (v,u) £ G,v £ current_table do 

Parents(u) <— Parents(u) U {v} 
for each v 6 Parents(u) do 
compute 7„, u 
if 7i> u > A then 

l/CHMM «- UC[v][u][a] + 7v , u 
for each w such that C/C[io] [v] [a] ■ "f v u > A do 
UC[w][u][a] <- [/CH[ti][u]+i 1 ,,„[/CHHH 
current _table <— current _table U {«} 



Algorithm 2 describes the first step of our method that 
scans L. L is maintained sorted, first by action and then by 
time. It processes one action at a time and in chronological 
order. We use current-table to maintain the list of users 
who have performed the current action and have been seen 
so far, and A u to denote the number of actions performed 
by user u in L, and Parents(u) for the list of parents of each 
user u with respect to the current action a, that is, N[ n (u, a). 
For each action a and for each user it that performs it, we 
scan current-table to find its neighbors that already per- 
formed a and add them to the list of parents of u. Then 
for each parent v of u, we compute the direct credit 7„ )U (a) 
appropriately (line 9). For the ease of exposition, here we 
assume the simple definition 7„ >IJ (a) = 1/Ni n (u, a), that can 
be implemented as 7 = 1/| Par ents(u)\. If we want to use 
the more complex definition of direct credit given in Eq. (9), 



we need to learn the parameters r v>u for all edges and infl(u) 
for all nodes in advance and pass them on to Algorithm 2 as 
input, similarly to what the standard method does for influ- 
ence probabilities. Although it is straightforward to learn 
these parameters by means of a preliminary scan of L, we 
refer the reader to [7] for an efficient way to learn them. The 
total credit given to various nodes for influencing u is then 
computed using equation 5 (lines 10-13). For the sake of re- 
ducing memory requirements, we use a truncation threshold 
A and discard credits that are below the threshold. In the 
experiments we will assess the effect of this truncation. 

Algorithm 3 Greedy with CELF 

Input: UC, k 
Output: seed set S 







x.it 



0; add x to Q 



SC <- 0; S i- 0; Q 
for each u 6 V do 

x.mg <— computeMG(x); 
while \S\ < k do 
x <— pop(Q) 

if x.it = \S\ then S(-SU {x}; update(x, UC, SC) 
else 

x.mg <— compute M G(x , UC, SC); 

x.it <— \S\; Reinsert x into Q and hcapify 



Once the first phase is completed, we use the standard 
greedy algorithm with the CELF optimization [12] to select 
the seeds (Algorithm 3). The algorithm maintains a queue Q 
where an entry for a user x is stored in the form {x, mg, it), 
where mg represents the marginal gain of user x with re- 
spect to seed set in iteration it. Q is always kept sorted 
in decreasing order of mg. Initially, the influence spread of 
each node is computed and Q is built (lines 2-3). In each 
iteration, the top element x of Q is analyzed. If x is an- 
alyzed before in the current iteration (that is, x.it = \S\), 
then it is picked as the next seed node and the subroutine 
update is called (line 6). On the other hand, if x.it < \S\, 
we recompute the marginal gain of x with respect to S by 
calling the subroutine computeMG (line 8). Then x.it is set 
appropriately and x is re-inserted into Q (line 9). 



Algorithm 4 computeMG 



Input: x,UC,SC 
Output: mg 

1: mg = 

2: for each action a such that 3m : C/C[x][n][a] > do 
3: mg a <- 1/A X 

4: for each user u such that UC [x] [u] [a] > do 
5: mg a <- mg a + VC[x] [u] [a]/A u 
6: mg <— mg + mg a (l — SC[x][a]) 



Algorithm 5 update 

Input: x,UC,SC 



for each action a such that 3u : UC[x][u][a] > do 
for each u such that U C [x] [u] [a] > do 
for each v such that f7C[u][a;][a] > do 

[/CM [it] [a] <- E/C[u][u][a] - f/C[u][i][a] • l/C[a:][u][a] 
SC[u][a] <- SC[u][a] + UC[x][u][a\ ■ (1 - SC[x][a]) 



Algorithm 4 computes the marginal gain of a node x with 
respect to the current seed set S. It leverages Theorem 3 
to do this efficiently. For an action a, lines 3-5 compute 



the term 



' (a) and line 6 multiplies it by the 



term (1 — Ts,x{a)). Finally, whenever a user x is added to the 
seed set, the subroutine update is invoked and Algorithm 5 
updates both UC and SC, using Lemmas 2 and 3. 
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Memory requirements: Our algorithm requires to main- 
tain the data structure UC whose size is potentially of the 
order of ^2 aGA \V(a)\ 2 . In reality, total credit decreases 
sharply with the length of paths. Thus by ignoring values 
that are below a given truncation threshold A, the memory 
usage by our algorithm can be kept reasonable. We study 
the effect of A in the experiments in the next section. 
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6. EXPERIMENTAL EVALUATION 

The goals of our experiments are manifold. At a high 
level, we want to evaluate the different models and the op- 
timization algorithms based on them with respect to accu- 
racy of spread prediction, quality of seed selection, running 
time, and scalability. We perform additional experiments 
on the CD model and on the influence maximization algo- 
rithm based on it, to explore the impact of training data size 
on the quality of the solution and the impact of truncation 
threshold on the quality, running time, and memory usage. 
The source of the code used in our experiments is available 
at http : //people . cs .ubc . ca/~goyal/ code-release .php. 

We experiment on the same two real world datasets of 
Section 3 (Table 1). While we use the "large" versions of the 
datasets only to study the scalability of our method, "small" 
versions of the datasets are used to compare our algorithm 
with other methods (that do not scale to the large versions). 

Methods Compared. Since methods based on arbitrar- 
ily assigning edge probabilities are dominated by those that 
learn them from the past propagation traces (see Section 3), 
in our evaluation, we focus only on the following models. 

IC model with edge probabilities learnt from the training 
set by means of the EM method [14] . In all the experi- 
ments we run 10k MC simulations. 

LT model with 10k MC simulations. We take ideas from 
[10] and [7] and learn weights as p v>u = A V 2 U /N where 
A V 2u is the number of actions propagated from v to u 
in the training set and N is the normalization factor to 
ensure the sum of incoming weights on each node is 1. 

CD model with direct credit assigned as described in 
Equation (9). Unless otherwise mentioned, the truncation 
threshold A is set to 0.001 (see section 5.3). Later we also 
study effect of different truncation thresholds. 

Accuracy of Spread Prediction. In Section 3 ("Exper- 
iment 2"), we evaluated methods using IC and LT models 
where edge probabilities are arbitrarily assigned and meth- 
ods that learn them from available data, with respect to the 
accuracy of spread prediction. We conduct a similar exper- 
iment to compare the IC, LT, and CD models. Fig. 3 shows 
the RMSE (computed exactly in the same way as in Section 
3) in the spread predicted by the IC, LT, and CD models, 
as a function of actual spread for both datasets. An in- 
teresting observation from the figure is that while IC beats 
LT by a large margin on Flixster_Small, it loses to LT 
on Flickr_Small by a considerable margin. On the other 
hand, the CD model performs very well on both the datasets. 

In order to have a better understanding of the results, 
we conduct a detailed analysis. Fig. 4 depicts the propor- 
tion of propagation traces captured within a given abso- 
lute error, which is the absolute difference between the esti- 
mated spread and actual spread. More precisely, for a given 
method, a point (a;, y) on its plot says that the fraction of 
propagation traces (in the test set) on which the (absolute) 



Figure 3: RMSE vs Propagation Size 
Flixster .Small (left) and Flickr .Small (right). 
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Figure 4: Number of propagations captured 
against Absolute Error on Flixster_Small (left) and 
Flickr Small (right). 

prediction error of that method is < x, is y. For instance, on 
Flixster_Small, for absolute error < 30, CD model cap- 
tures 67% of propagations (that is, 3391 out of 5128 propaga- 
tions). On the other hand, the percentages of propagations 
captured within the same error by IC and LT model are 46% 
and 26% respectively. Once again, it can be seen that while 
IC performs better than LT on Flixster_Small, it's the 
other way round on Flickr_Small. This plot shows con- 
clusively that within any given error tolerance, CD is able to 
capture a much higher fraction of propagation traces than 
IC and LT, on both data sets, confirming that CD model 
is more accurate when it comes to predicting the influence 
spread of a given seed set. 

Seed Set Intersection. Having established that CD is 
much more accurate in predicting actual spread, we next ex- 
amine the question, how close to each other are the (near) 
optimal seed sets for the influence maximization problem, 
obtained by running the greedy algorithm under different 
models. Fig. 5 shows that the intersection of seed sets ob- 
tained from IC model with the seed sets obtained from LT 
and CD models is empty. On the other hand, there is a 
significant (~50%) overlap between CD and LT models. We 
note since the greedy algorithm with MC simulations runs 
too slow on Flickr_Small (more on this later), in Fig. 5 we 
use PMIA [2] (for IC model) and LDAG [4] (for LT model) 
heuristics to obtain the seed set (only for Flickr_Small) 
in order to finish this experiment within a reasonable time. 5 
This shows that the seed sets obtained from IC and LT mod- 
els are very different from CD model. The difference is much 
more pronounced in the case of IC model. These findings 
are significant since together with the results of the previous 
experiment, they offer some evidence that the seeds chosen 
by IC and LT models run the risk of being poor with re- 
spect to the actual spread they achieve. We strengthen this 
evidence by conducting the next experiment. 

Spread Achieved. In this experiment, we compare the 



5 Chen et al.[2, 4] have shown the spread obtained from 
PMIA and LDAG are very close to those obtained via MC 
simulations for IC and LT. 
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Figure 5: Size of seed set intersection for k — 50 on 
Flixster_Small (left) and Flickr Small (right). 
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Figure 6: Influence spread achieved under CD 
model by seed sets obtained by various models on 
Flixster .Small (left) and Flickr .Small (right). 

influence spread achieved by the seed sets obtained from the 
three methods. For the sake of completeness, we also include 
the heuristics High Degree and PageRank which select as 
seeds the top-fc nodes with respect to degree and PageRank 
score respectively, as done in [10, 2]. 

One issue we face is that due to the sparsity issue, we 
cannot determine the actual spread of an arbitrary seed set 
from the available data. The next best thing we can do is 
pick a model that enjoys the least error in spread prediction 
and treat the spread predicted by it as actual spread. In 
this way, for any given seed set, we can use that model to 
tell us (its best estimate of) the actual spread. Given that 
CD model is found to be closest to reality in predicting the 
spread of a seed set (see Fig. 3 and 4), we use the spread 
predicted by it as the actual spread. The results of this 
experiment, depicted in Fig. 6, confirm that on both data 
sets the spread achieved by the seed sets found by methods 
using the IC and LT models falls far short of the actual 
spread, which is best approximated using the CD model. 
A surprising observation is that IC model performs poorly, 
even worse than heuristics like High Degree and PageRank. 
We looked in the data and found that the seeds picked by 
IC model are nodes which perform a very small number of 
actions, often just one action, and should not be considered 
as high influential nodes. We investigate the reasons below. 

For instance, on Flixster_Small, the first seed picked 
by the IC model is the user with Id 168766. While its influ- 
ence spread under IC model is 499.6, it is only 1.08 under 
CD model. In the data, the user 168766 performs only one 
action and this action propagates to 20 of its neighbors. As 
a result, the EM method [14] ends up assigning probability 
1.0 to the edges from 168766 to all its 20 neighbors, making 
it a high influence node, so much that it is picked as the first 
seed. Obviously, in reality, 168766 cannot be considered as a 
highly influential node since its influence is not statistically 
significant. In an analogy with Association Rules, the influ- 
ence of user 168766 can be seen as a maximum confidence 
rule, but which occurs only once (absolute support = 1). 

A deeper analysis tells us that most of the seeds picked 
by the IC model are of this kind: not very active nodes 
that, in the few cases they perform an action, do have some 
followers. We checked and found that the average number of 
actions performed by these seeds is 30.3 (against the global 
average of 167). On the other hand, the average number of 



actions performed by seed set obtained from our CD model 
is 1108.7. We found a similar behavior in Flickr_Small. 

Running Time. In this experiment, we first show results 
on the small versions of the data sets, for all three models, 
as a function of number of seed nodes selected. All the 
experiments are run on an Intel(R) Xeon(R) CPU X5570 @ 
2.93GHz machine with 64GB RAM running Opensuse 11.3. 
The algorithms are implemented in C++. 

Fig. 7 reports the 
time taken (in min- 10000 \W 

utes, on log scale) by 
the various models. 
It can be seen that 
our method is several 
orders of magnitude 
faster. For instance, 
to select 50 seeds 
on Flixster_Small, 
while the greedy al- 
gorithm (with CELF 
optimization) takes 40 and 25 hours under IC and LT 
model respectively, our algorithm takes only 3 minutes. 

We do not show a similar plot for Flickr_Small as the 
experiment takes too long to complete (for IC and LT mod- 
els). At the time of writing this paper, while the experiment 
for IC model ran for 27 days without even selecting a single 
seed, the experiment for LT model took the same time to 
pick only 17 seeds. On the other hand, our algorithm takes 
only 6 minutes to pick 50 seeds. 
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Figure 7: Running Time 
Comparison. 
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Figure 8: Runtime (left) and memory usage (right) 
against number of tuples. 

Scalability. Next, we show the scalability of our algorithm 
with respect to the size of the action log, in number of tuples. 
For this purpose, we created the training data set by ran- 
domly choosing propagation traces from the complete action 
log and selecting all the corresponding action log tuples. In 
Fig. 8 and 9, the x-axis corresponds to the number of tuples 
in the training set. 

Fig. 8 (left) shows the time taken by our algorithm to 
select 50 seeds against the number of tuples used. It should 
be noted that most of the time taken by our algorithm is 
consumed in scanning the action log. For example, it takes 
15 minutes to select the seed set when 5M tuples are used 
on FlixsterJLarge, out of which, 11.6 minutes are spent 
on scanning the action log and only 3.4 minutes are incurred 
in selecting the seed set. 

Fig. 8 (right) presents the memory usage with respect to 
the number of action log tuples used to select the seed set 
of size 50. Our algorithm's memory usage is proportional 
to the number of training tuples used: on Flixster_Large 
using 6.5M tuples, it requires approximately 16GB, while 
on 13M tuples on Flickr_Large, it requires approximately 
46GB. This raises the question how much training data is 
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Figure 9: Influence spread achieved and number of 
"true" seeds with respect to number of tuples used 
on Flixster_Large (left) and Flickr_Large (right). 

needed to select a good seed set, which we study next. 

Effect of Training Data Size. Fig. 9 shows the conver- 
gence of the output of our algorithm with respect to number 
of tuples used to select the seeds. Both plots have a double 
y-axis (left and right). On the left side, we have the spread 
of influence by the seed set obtained using a sample of train- 
ing data, while on the right side, we have the overlap of the 
seed set found with the "true seeds", i.e., the seeds selected 
by using the complete action logs, i.e., all 6.5M tuples in 
Flixster_Large and all 13M tuples in Flickr_Large. 

As can be seen, the quality of the seed set obtained by 
using only 1M tuples is as good as using all 6.5M tuples in 
case of FlixsterXarge. Similarly, in FlickrTarge, the 
influence spread "converges" after 8M tuples. 

These observations suggest that we need to use only a 
small sample of the propagation traces (or action log) to 
select the seed set and as a result, even though our algorithm 
can in principle be memory intensive when the action log is 
huge, in reality, the memory requirements are not that high. 

Effect of truncation threshold. Finally, we show the 
effect of truncation threshold A on the accuracy, memory 
usage and running time of our algorithm in Table 4. As 
expected, as we decrease the truncation threshold, while ac- 
curacy (measured in terms of number of "true" seeds dis- 
covered and influence spread achieved) improves, memory 
requirement and running time increase. Both the influence 
spread and "true seeds discovered" essentially saturate at 
A = 0.001. Note that in all our previous experiments, we 
used A = 0.001 which is a good choice as can be seen from 
the table. The results on Flickr_Large and on small ver- 
sions of the datasets are similar. 

7. CONCLUSIONS AND DISCUSSION 

While most of the literature on influence maximization 
has focused mainly on the social graph structure, in this pa- 
per we proposed a novel data-based approach, that directly 
leverages available traces of past propagations. 

Our Credit Distribution model directly estimates influ- 
ence spread by exploiting historical data, thus avoiding the 
need for learning influence probabilities, and more impor- 
tantly, avoiding costly Monte Carlo simulations, the stan- 
dard way to estimate influence spread. Based on this, we 
developed an efficient algorithm for influence maximization. 
We demonstrated the accuracy on real data sets by showing 
the CD model by far is closest to ground truth. We also 
showed that our algorithm is highly scalable. 

Beyond the main contributions, this paper achieves sev- 
eral side-contributions: (1) Methods which arbitrarily assign 
influence probabilities suffer from large error in their spread 
prediction compared with those that learn these probabil- 
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Table 4: Effect of truncation threshold A on 
Flixster_Large. "True seeds" are the ones obtained 
when A = 0.0001. 

ities from data. (2) The former methods end up choosing 
seed sets very different from the latter ones, suggesting the 
seeds they recommend may well have a poor spread. (3) 
The greedy algorithm using learned influence probabilities 
is robust against some noise in the probability learning step. 
(4) The IC and LT models, using learned influence proba- 
bilities, choose seed sets very different from each other, and 
in turn different from the CD model, which is by far closest 
to ground truth. These observations further highlight the 
need for devising techniques and benchmarks for compar- 
ing different influence models and the associated influence 
maximization methods. 
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